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ABSTRACT 

The Netherlands' secondary education system is highly 
differentiated, with four different school types for four scholastic 
ability levels. Final examinations must accommodate these four 
levels, and require a test-independent definition of the intended 
final ability levels as well as a sample-free evaluation of the range 
of ability levels at which a particular test will measure with 
sufficient accuracy. A method for locating, on a single scale, the 
ability distribution of a given population as well as the test's 
optimal reliability level is proposed and illustrated. The method is 
demonstrated using standardized tests of foreign language listening 
comprehension for the two highest levels of secondary education in 
the Netherlands. The languages tested are French, German, and 
English. Statistical analyses of the test results for each group are 
presented and discussed, and implications for test use are examined. 
(MSE) 
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1 Introduction 

Secondary education in the Netherlands is a highly differentiated system. Pour 
different schooltypes aim at four different scholastic ability levels. (Fig.l) 
In such a system the final examinations must natch these different levels and 
they therefore require a test-independent definition of the intended final 
ability levels aa well as a saiqple-f ree evaluation of the range of ability 
levels at which a particular test will measure with sufficient accuracy. The 
present paper presents a rathod to locate on a single scale the ability 
distribution of a given population as well as the level at whidi a particular 
test will yield optimally reliable estimations of the ability of a person 
ta)cing the test. The method can therefore be used to evaluate the fitness of a 
test for pass/fall dicisions in a particular population and, in the occasion, 
to adapt pretests to that level. 

The fitness of a test is a subjective argument as it is determined hy weiring 
reliability against efficiency. The discriminatory power required of a test is 
a functiai of the range in ability within the group to be tested and the 
inqportance of decisions based on the results. In the Netherlands high 
discrimination power is demanded as the ability continuum present in secondary 
education is subdivided in different schooltypes whereas within schools grade 
marks are distributed on a ten point scale (which in fact is a hundred point 
scale as rarks are given to oL>e decimal) and pass/fail decisions are based on 
these marks. To achieve hi^ discriminatory power at a certain ability level 
an it«n has to be of the appropriate difficulty. Any item that is too 
difficult or too easy will not discriminate optimally. To achieve reliability 
of test results at a certain ability level, a number of items of the 
appropriate difficulty is required. To assess the ability of persons in a 
group varying in ability a number of items of varying difficulties is 
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necessary. I£ variation in ability is high, tests will become too long. In 
such circumstances a series of tests of varying difficulty is more efficient. 
The importance of correct level assignment of tests can be seen in figures 2, 
3 and 4. Figure 2 presents the item characteristic curve (I.C.C.) for any 
item. The sum of the I.C.C. *s of all items in a test yields the test 
characteristic curve (T.C.C.) which theoretically has the same shape. 
The point is tjhat each item, and each test have hi^est d&^scrimatory power in 
one particular region of the ability continuum only and will lose 
discriminatory power in groups eidiibiting either too low or t©^ hi* abiUty. 
This phijnomenon is often referred to as floor, and ceiling effect* • Figure 3 
presents the T.C.C. of a pretest actually passed in two groups oC different 
abiUty levels. The test clearly discriminates better in group A whieh means 
it results will be more accurate, reUable and significant. The reverse csase 
is presented in figure 4, here the effect on KR20 reliability is even more 
dramatic. 

The method proposed to asses fitness will be demonstrated using Cito tests of 
foreign language listening coaprAension for the two highest levels of 
secondary education in the Netherlands. 



2 Material 

Cito test of foreign language listening comprehension have been constructed at 
Cito and used in final examinations in schools for over ten years. Hie tests 
were originally developed in a research project at the University of Utrecht 
(Groot, 1975). The objective of the tests, as defined hy Groot, is to evaluate 
foreign language learners* ability to understand the foreign language, spoken 
spontaneously by educated native speakers in normal conversational topo. 
Extremely informal elements as well as lexical and syntactical elements that 
are incon^rehensible to less educated native speakers and topics requiring 
specific knowledge are excluded from the language material in the tests. The 
language materia] consists of three or four interviews with native speakers of 
various occupations and professions and, for the highest level, a lecture on a 
subject of general interest is included in the tests. The item format is a 
multiple choice question with three options. The language material is divided 
in san^les of fourtgr to fifty seconds; the correct answer is a one phrase 
summary of the global contents of the sanple. A pause of twenty to twentyfive 



seconds is provided on the tape in between S2uiple8 to allow for the testees to 
read the item in their test booklet and tick their anser on an answer sheet. 
Test length varies from 40 to 50 items depending on school lev«l. 
Administration time varies accordingly between 50 and 60 minutes. Test 
reUabiUty {KR20) ranges from .70 to .85 with a noticeable tendency towards 
the higher values in the last few years. Acceptability of the tests is clearly 
reflected by the fact that more than 85 percent of the Dutch secondary schools 
use them in their final examinations. Though this description applies to the 
tests used in the present research design, since 1981 the scope of the tests 
has been enlarged by using different item formats e.g., modified cloze items 
allowing for a wider aeaapla of language material in the tests (De Jong, 
1983). 



The two hig^iest levels of secondary education in the N«th«rlandA are havd and 
VWO. HAVO is an intermediate level of general (nomrocational) education, meant 
aa a basis for further education prepa;-ing for hi^lher ium*«cadttmic functions. 
HAVO is a five year programme. Foreign languages are taa^t at an average rate 
of 3>j weekly periods. The final level is cooparabl* to that obtained in most 
%restem countries at the secondary school level. 

VWO, the highest level of general education, is msaiit as a preparation for 
further academic studies. VWO is a six year prograian. Foreign languages are 
tau^t at an average rate of three periods a week. (Fig. 1) 

The ability level required to understand a given sai^le of language material 
is hard to define as it is related to a conplex of factors such as: range and 
choice of vocabulary, grammatical con5>lexity, articulation and other speech 
features of the speaker, abstraction level of the message, density of 
information and acoustical conditions. Though a number of these factors can be 
measured, their effect on the difficulty of language material, and most 
certainly the interrelated pattern of influences of different factors, present 
in varying degrees, remains uncertain. There is no satisfactory solution to 
the measurment of language difficulty but the holistic approach. In practice 
test constructors have to rely on their intuition and the advice of a board of 
teachers in order to match sa]i{>les of language and educational levels. A 
pretesting procedure including the assessment of fitness of tests for the 
level aimed at would constitute a holistic but reUable check on the intuition 
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based level assignnient with the further advantage of the possibility to adapt 
the test to the reqpiired level by a careful selection of items. 

3 Method 

The method is based on item response theory using the Rasch model (Rasch, 
I960). In the ^sch model one item parameter and one person parameter 
determine the probability that a given subject aswers a given item correctly, 
rrhe it«ft parameter in the Rasdi model is called the difficulty parameter and 
the person parameter is called the ability parameter. However, since for the 
calibration of each test an arbitrary origin of measurement is selected the 
numerical values of the parameters of different tests cannot be ccopared 
directly* 

In orde^ to make them comparable the same origin has to be chosen for the 
different tests. This procedure, known as test eqpiating, can be followed if 
the testa ^iaasure the same ability and if one of the following conditicMis is 
met: 

*- the tests can3ktala a ndSI^ ^ ccnm items; 

*- the groups of persons taking the tests contain common persons; 
" the groups of persons taking the tests are representative samples from the 
same population. 

If both conditions are met and the different tests do not differ to much in 
difficulty (Petersen e.a., 1982) the item parameters of the tests and the 
ability parameters of persons taking the tests can be equated. 
Test equating in this study was done with a number of comnon items: 
Havo-pupils made part of a vm test and VWO-pupils made part of a Havo test 
apart from each group making its own test (fig. 5). The itene made by both 
groups constitute a so-called anchor test. After seperate calibrations of each 
test (including the anchor test) a pair of independently estimated parameters 
of each anchor item was available. As a result of senile independence of the 
estimation, these parameters are ecQiivalent except for an additive value that, 
apart from sanpling errors, is the same for all pairs in the anchor test. Once 
this additive constant on the anchor test has been calculated, ty taking the 
mean of the differences within all pairs, it can be added to the item 
parameters of all items in one test to bring the test on t>xe same scale as the 
other test in the equating procedure. 

Test analysis was done with the computer programme CALFIT (Wri^t and Mead, 
1975). 
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Ihe fit of the anchor-Items was evaluated by 

(a iH - <Jiv - Ch,v)^(N/12)[V(K-1)], (1) 

Which Is approximately x^-distrlbuted with 1 df (Wright and Stone, 1979). 
Total anchor fit was evaluated by 

K 

I ^""m " ''iV ■ Cg^v*^<*'/^2)[V{K-1)] ~ x^, (2) 
Where 

^iE' ^iv " PWfaneter estimates for item i in the anchor for the HJkVO and WO 
ta8t# repectively 

K 

^H,v *^iH " ^iv^/*' of difference within parameter pairs 

K ■ nud>er of anchor Items^ and 

H ■ n«l>er of pupils presented with the andior part of the test* 

After bringing both sets of parameters on a coonon scale, \^>e amount of 
information yielded by either test and thus the accuracy with whidi it 
measures at any particular ability level can be calculated with the f orMla 

k exp (5^^ « a ) 
^"J [1^P(5^ - a]))^ <3) 

where 5i is the ability parameter for siiJ^Ject i and Cj is the difficulty 
parameter for item j, k the number of icems. 

Ihe higher the information function, the more reliably decisions can bemade. 
Uie standard error of measuronent at a given ability level can be calculated 
directly from the information function with 

S. - ^ ,4, 
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Iffl order to assess the fitness of a test for a particular schooltype the 
distrtbtttion of ability within that schooltype has to be known. The 
Infotmatlon function reveals the area where the test measures best, the 
ability distribution reveals whether this area is important in the schooltype 
in question. 

Of course, purposes of testing can vary widely, ihe aim might be, for exaa?>le, 
to select the .highest one percent of the distribution in order to award 
scdiolarships or else to select the lowest 25 percent, to deny entrance. In 
fact for the tests under investigation here the primary purpose is to award 
qualifications as 'sufficient* or •insufficient',: pass or fail. 
Ihe p*!8s-fail borderline, or cut-off point, generally lies somewhat to the 
left of the modus of the distribution. It seems natural to demand that 
accuracy of measurement is highest in that region, where these most iiii)ortant 
decisions are made. 

In view of the reasons mentioned above, the cboice of items should be such, 
that the sum of their information functions, i.e. the test information 
function, reaches an optimum in the desired region of the ability 
distribution. With Hasch-calibrated items the maxlmom is N x 0.25, where N is 
the nUDber of items. Ihis maximum is reached when for a unique ability level, 
all items have a difficulty that equals exactly the diosen abiUty level. 
Since in that case (3) becomes H x 0.25. It is in this sense, that a test 
should consist of items best suited for a particular level. 
OSie distrlhution of true ability within the HAVO and WO gxaap has been 
calculated with a counter prograone developed by ^rstralen (1982). In 
general practice, as in this case too, the distribution of true ability does 
not differ much from the observed distribution of ability, provided tests are 
sufficiently reliable and samples are sufficiently large. In this paper 
therefore the observed distribution is used whenever it is more convenient. 



Results and discussion 

Figure 6 pr-aseni:- the information function of a pretest of French listening 
comprehension meant for the VWO level together with an indication of the 
observed distribution of ability in the VWO group. The pretest consists of 75 
items which is, deliberately, too long for use under examination conditions. A 
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selection of the most appropriate subset of 50 items will have to be made. Ihe 
pretest is off target as the information function reaches its maximum in an 
area where VWO pupils are not represented. The average random selection of 
50 items from the 75 pretested items would reduce test length only, without 
improving the information function. By a careful selection, however, of items 
at the difficulty level corresponding to the ability level at the cut-off 
point it is possible to achieve maximum information exactly in the relevant 
area where pass/fail decisions will be made. 

The selected items constitute a test considerably shorter than the original 
pretest whereas it yields almost the same amount of information for 
de ermining the ability of most pupils in the target group. 

Figure 7 presents the information function of the 50 items selected for the 
actual tests of French listening coni)rehension at the HAVO and the VWO level 
on the same scale, together with the estimated distributions of true ability 
for both populations. 

The HAVO and VWO population are sufficiently different to merit seperate 
tests. Items e.g., which the modal VWO pupil will pass in about 50% of the 
times will be answered correctly in less than 30% of occurences by the model 
HAVO pupil. On the other hand the difference between both populations is not 
too large to endanger the equating procedure. There is less than one z-score 
difference between the means of both populations on either test. The maxima of 
the test information functions of both levels practically coincide with the 
region *ihere pass/fail decisions will be made. 

Figure 8 presents indications of the observed distributions of ability in HAVO 
and VWO schools for French, German and English listening comprehension. On the 
ability continuum is shown how the observerd ability is transformed into grade 
marks at both levels. In the Netherlands the conversion of scores - i.e. 
observed ability - in grade marks is still calcutaled in relation to test 
results, -me cut-off point is chosen at roughly 20% of the cumulative 
frequency and assigned 5.5, maximum score is always 10. Ihis implies that, 
whatever the range in ability within a certain schooltype is, marks will 
always be distributed from 1.0 to 10.0. 

If then the range is small, as is the case for French, the difference between 
pupils earning low and high marks will also be smali, whereas if the range is 
larger, as for English, it takes a greater difference in ability to obtain a 
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higher earlc. Ohe range in French is prc*ably smaller because only 30 to 40% of 
the pupils take Blench as subject v^ereas English is taken by over 90% of the 
population. German takes an intermediate position with 40 to 50%. Another 
effect Of the wider distribution in both HAVO en VWO English is the lager 
amount of overlap in ability between both populations, which in its turn 
cau.es a smaller difference between marks given at the HAVO and the VWO level. 
The average HAVO pupil would have a fair chance to pass an English VWO test 
but he would certainly fail in French. 

It would only be fair to let HAVO pupils present themselves at VWO exams in 
English. However, as long as the cut-off point is determined in relation to 
test results this would necessarily entail lowering VWO Standards. It would 
therefore be recommendable that the relative determination of cut-off points 
be subtituted by an absolute method. By equating tests of subsecjuent years and 
Choosing a fixed cutoff point on the ability scale the standard for passing 
would be made independent from the group of pupils going i„ for an 
examination. It would seem reasonable to take the average cut-off point of a 
number of recent yearn as the new absolute cut-off point. 
Purthermore it should be equally discussed whether the distinction in 
schooltypes holds for all subjects in the same way. Given the much wider 
distribution Of ability in VWO English than in VWO French one could e.g. 
consider distinguishing two seperate levels within the VWO population for 
English. Ohe determination and evaluation of school curricula is ultimately a 
political issue, new methods in test analysis can offer sound criteria for 
decisions at the political level. 
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